A recent study has found that scientific citations generated by ChatGPT often do not correspond to real academic work. The study, published in the Canadian Psychological Association’s Mind Pad, found that “false citation rates” across various psychology subfields ranged from 6% to 60%. Surprisingly, these fabricated citations feature elements such as legitimate researchers’ names and properly formatted digital object identifiers (DOIs), which could easily mislead both students and researchers.
ChatGPT is an artificial intelligence language model developed by OpenAI, which is capable of generating human-like text based on the input it receives. As a part of the larger GPT (Generative Pre-trained Transformer) series, ChatGPT has been trained on a vast amount of text data, allowing it to generate coherent responses across various topics. This capability, however, also presents certain challenges, especially in contexts that require high accuracy and reliability, such as academic writing.
As AI tools like ChatGPT become more accessible and widely used, there is a growing concern about their implications for academic integrity. Specifically, the tool’s ability to “hallucinate” information — generate plausible but non-existent citations — poses a significant risk.
“Initially, I was interested in finding ways of identifying ChatGPT usage in student work that I was grading. When ChatGPT was released, I noticed more and more students talking about using ChatGPT and how to use it without being caught,” explained study author Jordan MacDonald, a PhD student in Experimental Psychology at the University of New Brunswick–Saint John.
“I took this as a challenge and started having ChatGPT prepare me papers on various topics to see what, if any, errors were produced consistently. That was when I noticed that a lot of the references that ChatGPT cited did not actually exist.”
“Hallucinated citations are easy to spot because they often contain real authors, journals, proper issue/volume numbers that match up with the date of publication, and DOIs that appear legitimate. However, when you examine hallucinated citations more closely, you will find that they are referring to work that does not exist.”
“The only alternative to a large language model generating these citations is that someone manually collected real authors, real journal names (along with issue and volume numbers), made up a fake title, and then constructed a fake DOI (which have a specific format and usually look like this: 10.1177/03057356211030985). The work it would take to pull together a fake citation would exceed the work it would take to just find a real one and do the work yourself.”
To investigate the accuracy of citations generated by artificial intelligence, MacDonald tasked ChatGPT 3.5 with generating 50 citations for six psychological subfields — religion, animal, social, clinical, personality, and neuropsychology — totaling 300 citations.
The authenticity of these citations was verified by checking their digital object identifiers (DOIs) against actual publications. If a DOI did not lead to a real document, it was marked as a hallucinated citation. MacDonald further scrutinized a random selection of both hallucinated and legitimate citations to investigate discrepancies in detail.
MacDonald found that a total of 32.3% of the 300 citations generated by ChatGPT were hallucinated. Despite being fabricated, these hallucinated citations were constructed with elements that appeared legitimate — such as real authors who are recognized in their respective fields, properly formatted DOIs, and references to legitimate peer-reviewed journals.
Hallucinated citations varied by subfield. For instance, ChatGPT only hallucinated three citations related to neuropsychology but hallucinated 30 citations related to psychology of religion research.
Interestingly, even when citations included legitimate DOIs that correctly redirected to real articles, MacDonald’s closer inspection often revealed mismatches. The cited articles did not always correspond with the titles, authors, or subjects provided by ChatGPT. For example, a DOI might lead to a genuine article on a completely different topic than the one ChatGPT described.
“The degree of hallucination surprised me,” MacDonald told PsyPost. “Almost every single citation had hallucinated elements or were just entirely fake, but ChatGPT would offer summaries of this fake research that was convincing and well worded.”
“As ChatGPT becomes more refined, I imagine this error will become less common, but as far as I am aware, citation and information hallucination is a tricky beast to tackle when developing language models. At the very least, hallucinated citations are both easy to identify and a likely indicator of ChatGPT (or other large language model) usage.”
Additionally, MacDonald observed that ChatGPT could accurately summarize scholarly articles if provided with correct and complete references by the user. However, left to its own devices, the model frequently “hallucinated” both the content and context of the citations.
“I think many people are both concerned and excited for the potential upsides and downsides to ChatGPT,” MacDonald said. “One of the upsides is that ChatGPT can be used by those who are well educated in a given field to do very topical literature scans. Someone who knows their field well may be able to use ChatGPT in an advantageous way, while also being able to catch errors.
“The downside, and the other end of that same stick, is that students and the general population might use ChatGPT to provide them with information on a topic while lacking the knowledge of said topic to be able to identify false or misleading information.”
“ChatGPT and other large language models definitely have many benefits but are clearly still in their infancy,” MacDonald explained. “I think the average person should be very cautious about using ChatGPT in the same way that they should be cautious about getting a cancer diagnosis from Dr. Google.”
“I think that educators should know that invalid references appear to be a reasonable way to identify AI-generated work but it is not a smoking gun, either. Students may use ChatGPT to help with an initial literature search and then write a paper on their own. The degree of wrongdoing may vary.”
As with all research, the study has some caveats to consider. The study’s scope was limited to one version of ChatGPT and a specific set of psychology subfields, and the nature of AI development means newer versions of ChatGPT may not exhibit the same patterns of hallucinated citations.
“ChatGPT is evolving and my findings may not be accurate to the same extent in future versions,” MacDonald noted, adding that “this is not my main field of research but I intend on continuing to find ways to identify plagiarism or academic misconduct using ChatGPT or other large language models. I hope to see these models trained in a way that can prevent students from abusing them.”
The study, “Dude, Where’s My Citations? ChatGPT’s Hallucination of Citations,” was published in the Winter 2023 issue of Mind Pad.